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Abstract 
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Reshaping  of  arrays  is  a  convenient  programming  primi¬ 
tive.  For  arrays  encoded  in  a  binary-reflected  Gray  code 
reshaping  implies  code  change.  We  show  that  an  axis 
splitting,  or  combining  of  two  axes,  requires  communica¬ 
tion  in  exactly  one  dimension,  and  that  for  multiple  axes 
splittings  the  exchanges  in  the  different  dimensions  can 
be  ordered  arbitrarily.  The  number  of  element  transfers 
in  sequence  is  independent  of  the  number  of  dimensions 
requiring  communication  for  large  local  data  sets,  and 
concurrent  communication.  The  lower  bound  for  the 
number  of  element  transfers  in  sequence  is  y  with  K 
elements  per  processor.  We  present  algorithms  that  is 
of  this  complexity  for  some  cases,  and  of  complexity  K 
in  the  worst  case.  Conversion  between  binary  code  and 
binary- reflected  Gray  code  is  a  special  case  of  reshap¬ 
ing. 


1  Introduction 

In  computer  systems  locality  of  reference  has  had  a  sig¬ 
nificant  impact  on  performance  ever  since  memory  hi¬ 
erarchies  were  introduced.  In  modern  computer  sys¬ 
tems  small  memories  in  MOS  technologies  may  be  de¬ 
signed  for  higher  speeds  than  larger  memories.  In  multi¬ 
processor  systems  with  processors  and  memory  modules 
interconnected  via  a  network,  the  access  time  for  non¬ 
local  information  is  typically  considerably  longer  than 
local  access.  Moreover,  the  access  time  depends  upon 
the  network  topology,  congestion  and  bandwidth  of  the 
communications  network.  The  reference  pattern  has  a 
significant  impact  on  the  optimal  data  allocation  in  net¬ 
works  that  have  a  non-uniform  distance  between  pairs 
of  nodes,  such  as  Boolean  cube  networks. 

In  well  structured  computations  the  data  is  conve¬ 
niently  represented  by  arrays.  Many  algorithms  require 
local  references  in  a  Cartesian  space  corresponding  to 
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the  array.  Explicit  methods  for  the  solution  of  partial 
differential  equations  are  examples  thereof.  Preserving 
the  locality  in  the  Cartesian  space  when  mapped  to  the 
processor  network  is  important  with  respect  to  perfor¬ 
mance.  The  binary-reflected  Gray  code  is  often  used  to 
accomplish  this  task  in  Boolean  cube  networks.  Succes¬ 
sive  integers  in  the  decimal  encoding  differ  by  one  bit  in 
their  Gray  code  encoding.  This  property  is  used  in  CM- 
Fortran  [1],  Thinking  Machines  Corp.  version  of  Fortran 
8X  [11]  for  the  Connection  Machine.  In  this  language 
implementation,  array  axes  are  by  default  encoded  in  a 
binary- reflected  Gray  code. 

Some  important  algorithms  with  a  regular  communi¬ 
cation  pattern  depend  on  local  references  in  a  Boolean 
space.  For  instance,  the  Fast  Fourier  Transform  re¬ 
quires  communication  in  the  form  of  a  butterfly  net¬ 
work,  which  implies  communication  between  adjacent 
nodes  in  a  Boolean  space  with  corresponding  nodes  in 
different  ranks  mapped  to  the  same  processor.  In  many 
scientific  and  engineering  applications  algorithms  that 
depend  upon  both  types  of  access  patterns  may  be  used, 
and  conversion  between  the  two  storage  forms  may  be 
important. 

Many  recursive  algorithms  make  use  of  axis  split¬ 
ting,  or  combining.  An  example  is  the  data  parallel 
implementation  [2]  of  the  divide- and- conquer  algorithm 
by  Dongarra  and  Sorensen  [3]  for  computing  eigenval¬ 
ues  of  symmetric  tridiagonal  systems.  Array  manipula¬ 
tion  through  operations  such  as  RESHAPE  in  Fortran 
8X  and  APL,  impacts  the  encoding  for  binary-reflected 
Gray  coded  axes.  The  encoding  of  binary  coded  axes  is 
unaffected. 

Different  axes  may  have  different  encoding.  For  in¬ 
stance,  if  butterfly  computations  are  performed  along 
one  axis,  and  nearest-neighbor  communications  in 
a  Cartesian  space  along  the  other  axis  of  a  two- 
dimensional  array,  then  binary  encoding  of  the  first  axis 
and  binary-reflected  Gray  code  encoding  of  the  second 
axis  is  desirable.  Furthermore,  the  encoding  of  a  sin¬ 
gle  axis  may  be  mixed.  Typically  the  number  of  array 
elements  along  an  axis  exceeds  the  number  of  proces¬ 
sors  allocated  to  the  axis,  forcing  several  elements  along 
an  axis  to  be  allocated  to  the  memory  of  each  proces- 
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*or  with  the  array  elements  being  allocated  as  evenly  as 
possible.  Cyclic  and  con#cc*tifc  [6]  allocation  are  two 
common  schemes  for  assigning  multiple  elements  to  pro¬ 
cessors.  With  local  random  access  memories  distance  is 
not  an  issne  in  determining  the  encoding  for  the  local 
memories.  Binary  encoding  is  typically  used  for  the  lo¬ 
cal  part  of  an  axis,  and  binary-reflected  Gray  code  for 
the  processor  part. 

As  an  example  consider  a  two-dimensional  logic  ar¬ 
my  A  of  shape  P  x  Q  allocated  to  an  Nx  x  N0  physical 
army  of  processors,  where  P  =  2*,  Q  =  2f,  N\  =  2ni , 
=  p  >  nx  and  q  >  n0.  The  data  allocation  is 
consecutive,  and  each  array  axis  is  encoded  in  a  binary- 
reflected  Gray  code.  Bit  m  in  the  address  space  is  de¬ 
noted  gm  if  encoded  in  a  bin  ary- reflected  Gray  code,  and 
6m  if  encoded  in  binary  code.  Bit  zero,  or  dimension 
sero,  is  the  least  significant,  and  the  rightmost  dimen¬ 
sion  in  our  expressions.  The  symbol  ||  denotes  concate¬ 
nation  of  two  fields.  Axes  are  also  labeled  right  to  left. 
We  illustrate  the  allocation  as  follows 


(ffp-  ldp-2  ffp-m  ffp-m,  - 1  £p- 


paddr1 


maddr1 


0  0  0  0  0 
-  \  9f-2  '  '  fff-n*  yf-no-lfff-no-2 


paddr0 


maddr0 


The  processor  address  for  an  element  ( i ,  j)  of  the 
logic  array  is  formed  as  (paddr1(i)||paddr°(;)),  and  the 
local  storage  address  is  (maddr1  (t)||maddr°(i;))1  where 
(7p(i)  =  (0p-i0p-2  ‘  So)  «  thc  binary-reflected  Gray 
code  encoding  of  i,  and  Gf(j)  =  '  *  '  00 )  1S  *ke 

binary- reflected  Gray  code  encoding  of  ;.  Reshaping 
the  logic  arTay  into  a  one- dimensional  array  such  that 
j)  — *  xQ  +  j  preserving  the  assignment  of  bits  in  the 
logic  array  to  bits  in  the  physical  address  space  implies 
a  code  conversion  for  axis  zero  if  i  is  odd,  and  data  mo¬ 
tion  within  no  dimensional  subcubes.  The  result  is  an 
allocation  of  the  form 


+  f -  1  <7p+f-2  '*‘£7p  +  f-ni  0p  +  f-«i-l!7p+f-«i--2'  II 

N,“— ■  V  * 

p»ddrl  m»ddxl 

g,-i9l-3  ■  •  »»-«.  ■  ■  -  9o) 

\  —  ^  -S'  - - V - - - / 

paddr*  maddr* 

where,  as  shown  later,  <7m+f  =  m  €  {0,  •  •  •  ,p  -  1} 
and  fa  =  rn  6  <0,  -  *  * , «  -  2}.  The  value  of 
depends  upon  the  value  of  gf.  Figure  1  illustrates  the 
data  motion. 

Note  that  whereas  the  initial  data  allocation  was  con¬ 
secutive  the  data  allocation  after  reshaping  is  not.  If 
a  consistent  data  allocation  is  desired,  i.e.,  the  same 
data  allocation  scheme  before  and  after  reshaping,  then 
it  is  in  general  necessary  to  change  the  assignment  of 
dimensions  in  the  logic  address  space  to  dimensions  in 
the  physical  address  space.  A  dimension  permutation 
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Figure  1:  Reshaping  an  1  x  16  array  to  a  4  x  4  array. 


[4,13,12,15,10,5]  in  the  form  of  an  n0  step  right  cyclic 
shift,  or  p  -  nj  steps  left  cyclic  shift  on  the  dimensions 
in  the  field  (maddr1  ||paddr°)  is  required,  in  combination 
with  code  conversion. 

With  consecutive  allocation  of  A  and  a  binary  encod¬ 
ing  of  local  addresses,  and  a  binary-reflected  Gray  code 
encoding  of  processor  addresses,  the  processor  address  of 
element  (t, ;)  is  formed  by  computing  the  address  from 
the  binary-reflected  Gray  codes  of  [i/AfiJ  and  [j/NoJ- 
The  local  memory  address  is  determined  from  the  bi¬ 
nary  codes  of  i  mod  N\  and  j  mod  N0.  The  encoding  of 
the  address  field  is 

(ffp-lffp-2  ’  *  *  0p- tm  ^p-fM-l^p-n^  ‘  ‘  ’  *0  II 

' - * - ^ - - - - ' 

paddr1  maddr1 


0  l0  l0 

paddr0  maddr0 


Reconflguration  of  the  processor  array  is  equivalent  to 
changing  the  assignment  of  dimensions  in  the  logic  ad¬ 
dress  space  to  dimensions  in  the  physical  address  space. 
A  dimension  permutation  is  required.  If  the  encoding 
of  the  local  address  field  is  different  from  the  proces¬ 
sor  address  field,  then  a  code  conversion  is  required  in 
combination  with  the  dimension  permutation.  Reconfig¬ 
uration  of  a  processor  array  may  be  required  to  assure 
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that  all  operands  use  the  same  physical  machine  con¬ 
figuration,  as  for  instance  in  matrix  multiplication  on 
the  Connection  Machine  [8].  The  Connection  Machine 
Fortran  compiler  allocates  logic  arrays  to  the  processors 
by  defining  a  processor  array  congruent  to  the  logic  ar¬ 
ray  for  each  array.  Hence,  in  the  matrix  multiplication 
C  «—  A  x  B  all  three  matrices  may  assume  a  different 
shape  of  the  processor  array. 

Hi  this  paper,  we  show  how  an  axis  splitting,  or 
the  combining  of  two  axes  into  one,  can  be  performed 
by  a  tingle  exchange  operation.  For  multiple  axes 
split /merge  operations,  the  number  of  element  trans¬ 
fers  in  sequence  is  independent  of  the  number  of  axes 
created  or  merged,  if  the  communication  system  allows 
concurrent  communication  in  all  required  dimensions. 
The  number  of  element  transfers  in  sequence  is  only  a 
function  of  the  size  of  the  local  data  set,  if  there  is  a 
large  local  data  set.  The  minimum  number  of  element 
transfers  in  sequence  is  equal  to  the  number  of  dimen¬ 
sions  requiring  communication.  The  conversion  between 
binary- reflected  Gray  code  and  binary  code  is  equiva¬ 
lent  to  reshaping  between  a  one- dimensional  array  and 
a  2  x  2  x  •  x  2  array  of  dimension  n. 

The  algorithms  we  give  for  reshaping  and  code  conver¬ 
sion  are  either  asymptotically  optimal,  or  optimal  within 
a  factor  of  two  with  respect  to  data  transfer  time.  The 
control  information  can  be  computed  locally  from  the 
node  address.  The  code  conversion  can  start  in  any  di¬ 
mension,  and  the  required  exchanges  can  be  carried  out 
in  dimensions  ordered  arbitrarily.  This  property  allows 
reshaping  by  concurrent  communication  in  all  required 
dimensions,  if  the  size  of  the  local  data  set  exceeds  the 
number  of  dimensions  requiring  communication.  Com¬ 
pared  to  the  algorithms  in  [6,7]  the  new  algorithms  avoid 
the  pipeline  delay.  Here  we  only  treat  the  case  with 
an  entire  axis  encoded  in  either  binary  code,  or  binary- 
reflected  Gray  code.  Furthermore,  we  assume  a  fixed 
assignment  of  dimensions  in  the  logic  address  space  to 
dimensions  in  the  physical  address  space.  Reshaping 
combined  with  dimension  permutations  is  considered  in 

The  paper  is  organized  as  follows.  Notation  and  def¬ 
initions  are  introduced  next.  Array  reshaping  is  dis¬ 
cussed  in  Section  3.  The  conversion  between  binary- 
reflected  Gray  code  and  binary  code  is  discussed  in  Sec¬ 
tion  4,  followed  by  summary  in  Section  5. 


2  Preliminaries 


a  bit  with  value  one.  “||”  is  the  concatenation  symbol 
For  the  complexity  estimates  we  assume  bi-directional 
channels  and  concurrent  communication  on  all  channels. 
The  number  of  elements  per  node  is  K .  Gn  is  the  se- 

2uence  of  n-bit  binary- reflected  Gray  codes  for  Zjf,  i.e., 

)B  =  (0.(0),  0.(1  ),.-,0.(r-l)). 


Definition  1  [14]  The  binary-reflected  Gray  code  is  de¬ 
fined  recursively  as  follows. 

6i  =  (<?! (0),  Gi(l)),  where  C,(0)  =  0,G,(1)  =  1. 


(  0110.(0)  \ 
0110.(1) 


6n+l  = 


0||ffn(2n  -  2) 

0||Gn(2n  -  1) 

l|K?n(2n~l) 
1||<7»(2*  -  2) 


V  1IKM0)  J 


In  the  following  we  always  refer  to  the  binary-reflected 
Gray  code  defined  above. 


Corollary  1  The  highest  order  bit  is  the  same  in  the 
binary  code  and  the  binary-reflected  Gray  code .  The 
remaining  bits  in  the  encoding  of  t  €  Z&/2  are  de¬ 
fined  by  Gn-i((&n-2&n-3  ••  &o))'  The  remaining  bits 
*n  the  encoding  of  t  €  Zjf  —  Z^f2  nre  defined  by 
Gn-i((bn-2bn-3  ■  1  *  &o))-  Thus, 


Gn((bn-ibn-2  *  *  *  bo)) 


if  bn-  1  =0, 

&n-l||<?n-l((kn-2&n-3  *  ‘  *  M), 
t/frn-1  =1. 


Proof:  From  Definition  1.  I 


Corollary  2  The  integer  encoded  in  the  neighbor  of 
node  (7*(t)  *n  cube  dimension  j  is  (yn(t  ©  (F  +  1 )),  i.e., 
Gn(t)©2>  =Gn(t©(l>  +  1)). 

Proof:  It  follows  from  Corollary  1.  I 


A  Boolean  n-cube  has  N  =  2”  nodes.  Two  nodes 
are  adjacent  if  and  only  if  their  addresses  differ 
in  exactly  one  bit.  The  binary  encoding  of  t  is 
Bn(i)  =  (&n-i&n-2  "  bo)  and  its  binary-reflected  Gray 
code  encoding  is  G«(t)  =  (pn-iffn-2  *  go)-  Zfj  = 
{0,1,  *,^  -  1}  and  (F)  is  a  string  of  j  instances  of 


Definition  2  With  binary-reflected  Gray  code  encod¬ 
ing  of  an  JV-element  one- dimensional  array  A[i],  t  €  Zn 
into  an  n-cube,  address  (?*(«)  contains  A [»] . 

Lemma  1  [14]  6m  =  gn- \  ©  gn-2  ©  •  *  *  ©  gm ,  m  £  Zn  • 
Conversely ,  gm  =  bm  ©  m  €  Zn  with  bn  =  0 
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Figure  2:  Reshaping  between  two  arrays  with  bi 
nary-reflected  Gray  code  encoded  on  a  Boolean  cube. 


Definition  3  Let  A  be  an  array  of  shape  Uj-i  x  Ud-2  x 
■■■xUo,U  =  (Vi-uUi-2,  •  • .  0o).  0m  =  2"-,  m  €  ZH, 

V  =  (Kr-i.K#T3.---.Vb).  Vm  =  2’"’  m  6  Z*  *nd 

n^'Jo  U~  =  ITm=o  V'™-  The  reshape  function  p(U,V) 
transforms  the  shape  of  the  array  A  from  U  to  V. 

Let  tii  =  (ElsO11")  -  1.  =  (Em=0  V"*)  -  1.  W  = 

{ui  I  0  <  k  <  d-1),  V  =  {vj  |  0  <  k  <  d'-l)  and  V  = 
(MUV)-(Wn  V).  The  sets  U  and  V  are  the  sets  of  most 
significant  dimensions  for  the  axes  of  the  shapes  U  and 
V,  with  the  most  significant  axes  excluded.  For  instance, 
ilU  =  (25,23,24,23)  and  V  =  (2J,  27, 2\  2s),  then  U  = 
{8, 6, 2},  V  =  {10, 8, 4}  and  V  =  {10,  6,  4,  2},  Figure  2. 
To  form  the  shape  V  from  U  communication  is  required 
in  the  set  of  dimensions  defined  by  U  —  V  for  axes  being 
combined  into  one,  and  the  set  of  dimensions  defined  by 

V  -  U  for  axes  being  split.  V  is  the  set  of  dimensions 
for  which  communication  is  required  for  changing  the 
shape  U  into  V.  Up.ddi  i*  ‘he  subset  of  dimensions  in  V 
assigned  to  processor  dimensions  in  the  physical  address 
space,  ©m.ddr  =  V  ~  PP.<Mt  «  ‘h«  **‘  of  dimensions  in 

V  assigned  to  local  memory  dimensions  in  the  physical 
address  space. 


3  Reshaping  Arrays 

Lemma  2  below  states  the  fact  that  splitting  an  axis 
into  two,  or  merging  two  axes  into  one,  requires  a  code 
change  in  precisely  one  dimension. 

Lemma  2  Assume  node  Gn(i)  contains  element  A[i], 
i  €  Zff,  snstsa//y.  If  all  nodes  i  =  (&n-.i&n-2  &o) 

$%ch  that  bm  ss  1  exchange  data  in  dimension 
m  -  1  for  any  m  €  {1,  2,---,n  -  1},  then 

node  G»-m((&n-l&n-3  *  *  *  ((&m- l^m-2  *  ‘  ’  ^o)) 

contain#  element  A[t]  after  the  exchange. 

Proof:  Assume  that  the  reshape  operation  is  U  — 
(2n)  V  =  (2n“m,2m),  and  that  address  Gn(0  = 


(fl_  i(»)ff»-2(0‘ *  *Po(*))  contains  element  A[i\. 

Let”  i  =  kV*  +  £,  L  €  2,-,  k  €  Zm-  Af¬ 
ter  the  reshape  operation  element  *,  now 
should  reside  in  address  (G*_m(*)||Gm(0),  wherc 

~  Gn^m({hn~m-l{k)bn~fn-2(k)  '  '  '  ^o(^)))  — 
(p*-m-l(*)Pn-m-3(fc)  ’-Po(*)) 

and  Gm{i)  =  Gm((6m-l(^m-3(0  --'MO))  = 

{gm-l{l)9m-2{t)  *  •  •  Po(0)* 

^From  the  binary  encoding  by(fc)  =  ^m+y (i),  j  6 
Zn„m%  and  6y(f)  =  by(t)»  i  €  By  Lemma  1,  py(fc)  = 

6y(*)©6y  +  1(fe)  =  f°F  ^ 

j  G  2n-m.  and py(f)  =  M0©*i+i(0  =  6y (t)©6y+i (t)  - 
Qj  (t)  for  all  j  6  Zm  —  1  •  But,  pm_l(/)  =  © 

6m(0  =  6m-l(0  and  Pm-l(»)  =  *m-l(0  ©  MO*  ic*’ 


fa-i(0 


={ 


if  6™(0  =  o. 

To.  ifM0  =  i- 


Pm  —  1 


Hence,  if  tm(i)  =  0  then  Gn(0  —  Gfn-m(^)ll^m(0 
and  no  data  motion  is  necessary  for  reshaping.  But, 
if  fem(i)  =  1  then  an  exchange  is  required  in  dimension 
m  - 1,  and  only  in  dimension  m  - 1,  since  this  dimension 
is  the  only  dimension  in  which  the  code  for  t  and  (k,l) 


The  change  in  the  binary-reflected  Gray  code  caused 
by  an  axis  splitting,  or  the  merging  of  two  axes,  is  lim¬ 
ited  to  the  most  significant  dimension  of  the  lower  or¬ 
dered  axis  in  the  created  pair  of  axes.  The  pairs  of  ad¬ 
dresses  exchanging  content  in  a  given  dimension  depend 
upon  the  order  of  exchanges  in  the  case  of  multiple  axes 
splittings.  The  control  of  the  exchange  is  derived  from 
bm  in  the  encoding  oft.  The  index  t  assigned  to  an  ad¬ 
dress  changes  if  a  more  significant  controlling  dimension 
is  one.  For  example,  consider  the  reshaping  of  an  array 
of  8  elements  encoded  in  a  binary- reflected  Gray  code 
to  an  array  of  2  x  2  X  2  elements  (which  is  equivalent  to 
conversion  to  binary  code).  Figure  3  shows  exchanged 
data  in  boldface,  and  two  exchange  orders:  dimension 
one  then  zero,  or  zero  then  one.  As  is  apparent  from 
Figure  3,  an  exchange  is  carried  out  in  dimension  one 
between  addresses  110  and  111  if  the  dimensions  are 
treated  in  the  order  one  first  then  zero,  but  not  if  the 
order  is  dimension  zero  first,  then  dimension  one. 

The  current  value  of  b,n  that  is  assigned  to  a  given 
address  (pw-iPn-3  *  *  •  Po)  i*  easily  determined  from  the 
address. 

Lemma  3  If  the  number  of  exchanges  in  dimensions 
more  significant  than  m  is  even,  then  the  current  value 
of  logic  dimension  m  assigned  to  an  address  Gn(0  = 
( pn  —  1  Pn  —  3  *  *  *  P0 )  ”  M  OthcTVUt  it  is  bm  • 

The  lemma  follows  directly  from  Corollary  2. 

Half  of  the  total  number  of  elements  need  to  be  ex¬ 
changed  for  any  split /merge  operation.  Hence,  the  num¬ 
ber  of  exchanges  in  which  ail  element  participates  falls  in 
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Figure  3:  Reshaping  an  array  of  8  elements  into  a  2  x  2  x  2 
array. 

the  range  0  -  |P|,  depending  upon  its  binary  encoding. 

n'-i  u 

The  total  number  of  element  exchanges  is  \V\  —y  —  for 
changing  shape  U  to  shape  V.  We  will  now  determine 
the  number  of  element  exchanges  in  sequence  when  the 

logic  array  is  allocated  to  an  n-cube,  with  -  =  K 

elements  per  processor. 

Theorem  1  A  lower  bound  for  the  number  of  element 
transfers  in  sequence  for  array  reshaping  affecting  the 
encoding  of  processor  dimensions  is  K/2  with  K  ele¬ 
ments  per  processor. 

Proof:  Pick  a  dimension  d  €  Ppaddr*  There  are  N/ 2 
processors  that  need  to  transfer  data  across  dimension  d. 
There  are  K  elements  in  each  processor,  and  all  elements 
need  to  be  exchanged.  The  available  bandwidth  per 
dimension  is  N.  I 

In  the  following,  let  6  =  |I>paddr|- 

Theorem  2  Changing  the  shape  U  to  shape  V  preserv¬ 
ing  the  assignment  of  logic  dimensions  to  physical  di¬ 
mensions  requires  at  most  element  transfers  tn 

sequence  with  concurrent  communication. 

Proof:  Let  2>pRddl  =  {^-1,^-2,  d0}.  Partition 

the  local  data  set  of  size  K  into  6  sets  of  size  at  most 
]  each.  Label  the  data  sets  from  0  to  6  -  1.  Each 
such  set  is  assigned  a  sequence  of  dimensions  including 
all  dimensions  in  Pp»ddr  once.  Different  sets  are  assigned 
different  sequences  such  that  no  two  sets  have  the  same 
first,  second,  third,  etc.,  dimension.  For  instance,  let 
data  in  set  m  be  assigned  the  sequence  of  dimensions 
dm»  d(m  +  i)mod$ t  ’  '  ’  i  i),n0dS •  ® 

The  upper  bound  in  Theorem  2  differs  from  the  lower 
bound  by  a  factor  of  two.  The  upper  bound  can  be 
improved  in  some  cases.  We  give  upper  bounds  that  are 
almost  identical  to  the  lower  bounds  for  two  cases. 


Theorem  S  Changing  the  shape  U  to  shape  V  preserv¬ 
ing  the  assignment  of  logic  dimensions  to  physical  di¬ 
mensions  requires  at  most  6\^]  +1  element  transfers 
in  sequence  with  concurrent  communication,  if  no  two 
elements  ofVv% ddr  differ  by  one  and  K  >  2 6. 

Proof:  Consider  the  merging  of  a  single  pair  of  axes,  or 
splitting  of  an  axis.  Assume  the  communication  occurs 
in  dimension  m- 1.  Consider  a  2-cube  formed  by  dimen¬ 
sions  m  and  m  -  1.  Label  the  four  nodes  according  to 
By  Lemma  2,  communication  is  only  required 
between  nodes  10  and  11.  There  exist  two  edge-disjoint 
paths  between  these  two  nodes  of  lengths  one  and  three, 
respectively.  By  assigning  +  1  dements  to  the  path 
of  length  one  and  the  remaining  elements  to  the  path 
of  length  three,  (Tfjl  + 1)  element  transfers  in  sequence 
are  required. 

If  no  two  elements  in  2>p.ddr  differ  by  one,  then  the 
2- cubes  used  for  different  data  sets  are  disjoint.  Thus, 
*(F£l  +  1)  element  transfers  in  sequence  are  required. 
To  reduce  the  communication  complexity  to  +  1, 

we  slightly  overlap  the  communications  on  the  succes¬ 
sive  2-cubes  of  a  given  data  set.  Without  this  overlap 
no  data  is  sent  along  the  length  three  path  during  the 
last  two  cycles  of  the  routing  of  a  data  set.  By  send¬ 
ing  two  elements  that  have  been  routed  with  respect  to 
the  first  2-cube  to  the  length-three  path  of  the  second 
2-cube  during  the  last  two  cydes  of  the  routing  phase  of 
the  first  2-cube  (with  one  cycle  each),  the  ccinmunica- 
tion  delay  due  to  the  length- three  path  is  only  paid  once. 
Sending  elements  along  the  length-three  path  during  the 
last  two  cycles  of  the  first  2-cube  will  not  interfere  with 
the  communication  of  the  data  set  exchanged  in  the  sec¬ 
ond  2-cube.  The  reduced  complexity  is  valid  if  f  y]  >  2, 
i.e.,  some  data  set  has  at  least  three  elements.  I 

In  the  routing  used  for  the  proof  of  the  bound,  the 
number  of  elements  routed  along  the  length- one  path 
and  the  length-three  path  differ  by  two  only  for  the  first 
2-cube.  For  subsequent  2- cubes,  the  same  number  of 
elements  are  routed  along  each  path,  with  the  length- 
three  path  starting  two  cycles  earlier.  The  first  element 
on  both  paths  arrives  at  the  same  time  within  the  2-cube 
except  for  the  first  2-cube.  If  2 6  divides  K  and  K  >  28, 
then  the  complexity  is  y  + 1,  which  is  only  one  element 
transfer  above  the  lower  bound.  For  K  <  26 ,  there  is 
no  advantage  of  using  the  length- three  paths  over  the 
algorithm  used  in  the  proof  of  Theorem  2. 

If  the  reshape  operation  requires  communication  in 
dimensions  m  -  1  and  m  (by  creating  an  axis  of  length 
2  encoded  in  dimension  m),  then  dimension  m  cannot 
be  used  for  rerouting  to  access  unused  communication 
links  in  dimension  m  —  1.  Unused  links  in  dimensions 
lower  than  m  -  1  cannot  be  used  either,  since  they  do 
not  connect  to  processors  with  unused  links  in  dimension 
m  -  1.  However,  the  following  observation  can  be  used 
to  reduce  the  number  of  element  transfers  in  sequence. 
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Lemma  4  For  a  reshape  operation  requiring  communi¬ 
cation  in  dimension  m  — 1  none  of  the  links  in  dimension 
m  -  1  is  used  in  m  -  1  dimensional  subcubes  obtained 
through  complementing  any  of  the  address  dimenjtonj 
that  are  more  significant  than  m  —  1 . 

Proof:  We  need  to  show  that  in  any  m  -  1  dimensional 
subcube  defined  by  dimensions  m  and  higher,  bm  =  0  if 
the  address  defining  the  subcube  is  obtained  by  comple¬ 
menting  a  single  dimension  of  significance  m  or  higher. 
But,  by  Lemma  1  complementing  a  single  dimension  gj, 
j  e  {m,  m  +  1,  ■  ♦  • ,  n  —  1}  complements  6m.  I 

By  using  a  pipelined  algorithm  instead  of  the  non- 
pipelined  maximally  concurrent  algorithm  used  for  the 
upper  bound  in  Theorem  3,  the  properties  in  Lemma  4 
can  be  exploited  to  establish  the  following  bound. 

Theorem  4  Changing  the  shape  U  to  shape  V  requires 
at  most  [*£]  +  25  -  1  element  transfers  tn  sequence,  if 
for  each  dimension  requiring  communication  there  exists 
one  more  significant  dimension  not  requiring  communi¬ 
cation  and  K  >  25. 

Proof:  The  problem  is  equivalent  to  sending  K  ele¬ 
ments  along  a  path  of  length  5  and  each  edge  on  the 
path  is  paired  with  a  length- three  path,  disjoint  with  all 
other  edges.  If  5  is  even  two  edge-disjoint  paths  of  length 
25  can  be  defined  by  combining  length-three  and  length- 
one  paths  for  different  dimensions.  If  5  is  odd,  then  two 
paths  of  length  25-1  and  25  +  1  can  be  defined  in  a 
similar  way.  I 

Several  routing  schemes  yield  the  same  complexity  as 
the  scheme  used  in  the  proof.  For  instance,  by  creating 
one  path  of  length  5  and  one  of  length  35,  and  routing 
-j-  5  elements  along  the  short  route  and  [yj  —  5 
elements  along  the  long  route  the  same  routing  time  is 
achieved  if  K  >  25.  For  K  <  25,  the  latter  approach 
degenerates  to  using  a  single  path  of  length  5  and  the 
required  time  is  K  +  5  —  I,  which  is  lower  than  if  two 
paths  of  the  same  length  were  used.  However,  if  K  <  25 
then  the  time  for  reshaping  by  pipelining  along  one  path 
is  higher  than,  or  at  best  the  same  as  if  the  concurrent 
exchange  algorithm  in  the  proof  of  Theorem  2  is  used. 

Lemma  4  cannot  be  exploited  directly  for  concurrent 
exchange  sequences  because  an  exchange  in  one  dimen¬ 
sion  affects  the  set  of  edges  being  used  in  a  subcube. 
This  property  follows  from  Lemma  3.  For  instance,  if 
a  1  x  16  array  is  reshaped  into  a  4  x  2  x  2  array,  then 
if  an  exchange  in  dimension  one  is  performed  first  the 
required  exchanges  in  dimension  zero  are  all  on  corre¬ 
sponding  links  in  different  subcubes  instead  of  compli¬ 
mentary  links. 


4  Conversion  between  Gray 
code  and  binary  code 

Theorem  5  The  conversion  between  a  binary-reflected 
Gray  code  and  binary  code  in  either  direction  requires 
communication  in  n  -  1  dimenjions,  and  at  most  (n  - 
Ijl'.JL.'l  element  transfers  in  sequence. 

Theorem  5  follows  from  Theorem  2  and  the  obser¬ 
vation  that  conversion  from  binary-reflected  Gray  code 
to  binary  code  in  an  n-cube  is  equivalent  to  reshaping 
a  one-dimensional  array  of  size  2n  to  an  n-dimensional 
array  of  shape  2  x  2  x  •  ••  x  2. 

In  any  algorithm  according  to  Lemma  2  and  Theo¬ 
rem  5  only  half  of  the  communications  links  in  each 
of  the  n  —  1  dimensions  are  used  in  every  step  of  the 
algorithm.  Every  path  is  of  minimum  length,  and  all 
minimum  length  paths  are  used  evenly.  The  load  on  the 
communications  network  is  minimal. 

Conjecture  1  For  the  conversion  between  bmary- 
reflected  Gray  code  and  binary  code  encodings  of  K  eh 
emenis  per  processor  in  an  n-cube,  a  lower  bound  is 

For  n  =  2,  the  conjecture  follows  from  Theorem  3. 
For  n  >  2  only  the  most  significant  dimension  requires 
no  communication. 

Corollary  3  The  conversion  between  binary-reflected 
Gray  code  and  binary  code  encoding  tn  an  n-cube  can  be 
performed  as  an  arbitrary  sequence  of  communications 
tn  dimensions:  {0, 1,  •  •  ■ ,  n  —  2}. 

The  corollary  follows  from  the  observation  that  the 
control  is  completely  determined  by  the  binary  encoding 
of  t. 

An  algorithm  proceeding  from  dimension  n  -  2  to  di¬ 
mension  0  is  depicted  in  Figure  4.  Initially,  processor 
<?4(i)  contains  data  of  index  i.  After  the  conversion,  t  is 
assigned  to  processor  B+( t).  The  algorithm  is  described 
below.  Several  other  algorithms  are  given  in  [7]. 

f*  Converting  Gray  code  to  binary  code 
starting  from  the  most  significant  dimension  */ 

for  d  :=  n  -  2  downto  0  do 
if  gd+i  =  1  then 

exch.  content  with  the  neighbor  in  dim.  d 
endif 
enddo 

The  control  in  the  above  algorithm  is  particularly  sim¬ 
ple,  since  the  following  corollary  follows  from  Lemma  3. 
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Figure  4:  Conversion  of  a  binary-reflected  Gray  code  to 
binary  code 
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Figure  5:  Concurrent  conversion  of  a  binary-reflected 
Gray  code  to  binary  code. 


5  Summary 


Corollary  4  If  the  conversion  from  binary- reflected 
Gray  code  to  binary  code  proceeds  from  the  most  signif¬ 
icant  dimension  to  the  least  significant  dimension,  then 
the  current  value  of  bm  assigned  to  an  address  is  equal 
to  gmt  where  m  is  the  controlling  dimension. 

The  algorithm  is  easy  to  generalize  to  an  arbitrary 
starting  dimension  m,  m  6  Zn-i  with  exchanges  in  suc¬ 
cessive  dimensions  of  decreasing  order  in  a  cyclic  fash¬ 
ion.  The  first  exchange  requires  the  computation  of  6m. 
Figure  5  gives  an  example.  Sequence  2  is  the  same  as 
in  Figure  4.  The  figure  shows  the  location  of  i  for  each 
step  of  the  algorithm  for  each  sequence.  For  concurrent 
exchanges  the  local  data  set  K  is  divided  into  n  —  1 
sets,  and  set  m,  m  €  Zn-\  «  subject  to  exchange  in 
dimension  (n  —  2  —  m  —  <)  mod  (n  -  2)  during  step  t , 
t  €  Zn—\. 

I*  Converting  Gray  code  to  binary  code  starting  from 
dimension  m.  Dimensions  in  decreasing  order,  cyclically*/ 
if  dn-1  ®9n- 2  9  ©Pm+1  =  1  then 

exch.  content  with  the  neighbor  in  dim.  m 
endif 

for  d  :=  m  -  1  downto  0  do 
if  9d+ 1  =  1  then 

exch.  content  with  the  neighbor  in  dim.  d 

endif 

enddo 

for  d  :=  n  -  2  downto  m  +  1  do 
itgd+i  =  1  then 

exch.  content  with  the  neighbor  in  dim.  d 

endif 

enddo 


We  have  shown  that  the  splitting  of  a  binary-reflected 
Gray  code  encoded  axis  into  two  binary-reflected  Gray 
coded  axes  only  requires  an  exchange  in  the  most  signif¬ 
icant  dimension  of  the  lower  order  axis.  The  exchanges 
required  for  multiple  axis  splittings  can  be  performed  in 
arbitrary  order. 

Assume  concurrent  communication  on  all  po~ts,  K  el¬ 
ements  per  processor,  and  6  dimensions  requiring  com¬ 
munication  for  the  reshape  operation.  If  K  is  a  multiple 
of  6 ,  then  the  number  of  element  transfers  in  sequence 
is  independent  of  6.  An  upper  bound  is  K  and  a  lower 
bound  is  y.  We  present  three  algorithms:  (i)  one  of 
communication  complexity  £[y],  (ai)  one  of  complex¬ 
ity  4*  1  for  reshape  operations  for  which  no  two 

dimensions  requiring  communication  are  adjacent  and 
K  >  26,  and  (iii)  and  one  of  complexity  y  -f  26  -  1,  if 
there  is  one  unused  processor  dimension  of  higher  order 
for  every  processor  dimension  requiring  communication. 
The  previously  best  known  algorithm  has  a  complexity 
ofK  +  6-l  [6]. 

The  conversion  between  binary-reflected  Gray  code 
and  binary  code  encodings  is  a  special  case  of  reshaping 
an  array,  and  can  be  carried  out  on  an  n-cube  by  n  - 
1  exchanges  in  dimensions  0, —  2  in  arbitrary 
order  with  a  complexity  of  at  most  (n  -  element 

transfers  in  sequence. 
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